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Microkernel Architecture 


e Each device drivers and lO service is a user mode 
process 


+ No IO through kernel 
Software Isolated Processes 


e Multiple processes live in the same hardware address 
space called domains 


+ Processes are strongly isolated from each other using 
memory/type safety 
e Each process has its own GC heap 


System Architecture 


Example: Web App IO Pipeline 


+ With data travelling through many processes, we 
need an efficient data transfer mechanism. 


+ Data must not live on the GC heap 


Shared Heap 


e Support bulk data sharing and transfer in IO path 
+ One SharedHeap per domain 
+ No relocation 
+ No typed objects, pointers, or references 
+ Process level lifetime management and accounting 
e Data can be 
exclusive: single owner, content is mutable 
> or shared: potentially multiple owners, content is immutable 


* Lifetime is managed by handles, exposed to app code as 
Stream, SharedData, or SharedMultiSpan 


SharedData 


+ Small GC Heap object to represent the data on the Shared 
Heap 


+ Manages data life time, guarantee no use-after-free 

+ Can have multiple segments 

+ Zero-copy to concatenate, append headers, strip off headers 
+ Disposable and Finalizable 

e Content is immutable 


+ Can be transferred or shared across process boundary 
efficiently 


SharedData Implementation 


class SharedData 


struct struct 
SharedHeapDescriptor SharedHeapDescriptor 
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class SharedData 


$ 


struct struct 


SharedHeapDescriptor| | 


SharedHeapDescriptor SharedHeapDescriptor 


ae 


SharedHe apDescriptor 


| 
| 
Y 


class 
SharedHe apRegion 


Y 


class 
SharedHeapRegion 
Finalizer 


Y 


class 
SharedHeapRegion 


Y 
class 


SharedHeapRegion 
Finalizer 


Example of Sharing #3 


+ 
A 
N 
` 
\ 
\ 
y l me 
ca 1 
gema j 
` 
—> # 
N 
/ 
2 


SharedData Transfer 
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Only Immutable Data Can Be Shared 


e Avoid data races, TOCTOU 
+ Minimize CPU cache coherency protocol overhead 
e Allows simple and efficient caching 


+ Seems limiting, but this restriction turns out to be a 
great ally in system performance 


System Architecture 


Unifying Data Access: Span<T> 


+ APIs using Span<byte> can be used to process data on both 
GC Heap and Shared Heap. 


+ Deep runtime and class library integration 


byte[] gcData = ...; 
Span<byte> s1 = gcData.GetSpan(); 
long decodedCharCount1 = Utf8Encoding.GetCharCount(in s1); 


SharedData shData = ...; 


Span<byte> s2 = shData.GetLongestSpanAt(@) ; 
long decodedCharCount2 = Utf8Encoding.GetCharCount(in s2); 


Primitive Structs 


+ Structs composed of only primitive types, no references 
+ APIs to read/write/cast primitive structs 


// Both T and U below are primitive struct types 
Span<byte> a = ...; 

T t = a.Read<T>(offset); 

a.Write<T>(offset, t); 


Span<T> b = ...3 
Span<U> c = Primitive.Cast<T, U>(b); 


Example 
Reading ICMP header from a network packet 


public primitive struct IcmpHeader : IInteroperable 


1 
public IcmpType Type; 
public byte Code; 
public BigEndian.UInt16 Checksum; 
public BigEndian.UInt16 Identifier; 
public BigEndian.UInt16 Sequence; 

} 

SharedData data = ... // The network packet 


IcmpHeader h = data.Read<IcmpHeader>(offset) ; 


Primitive Structs 


+ Structs composed of only primitive types, no references 
+ APIs to read/write/cast primitive structs 


// Both T and U below are primitive struct types 
Span<byte> a = ...; 

T t = a.Read<T>(offset); 

a.Write<T>(offset, t); 


Span<T> b = ...3 
Span<U> c = Primitive.Cast<T, U>(b); 


Example 
Reading ICMP header from a network packet 


public primitive struct IcmpHeader : IInteroperable 


{ 
public IcmpType Type; 
public byte Code; 
public BigEndian.UInt16 Checksum; 
public BigEndian.UInt16 Identifier; 
public BigEndian.UInt16 Sequence; 

} 

SharedData data = ... // The network packet 


IcmpHeader h = data.Read<IcmpHeader> (offset) ; 


Destructible Resource, SharedMultiSpan 


+ Allocated on the stack, not GC Heap 

e C++ destructor semantics 

+ Enforced by language and compiler 

+ No IDisposable, no finalization 

+ SharedMultiSpan is otherwise similar to SharedData 
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+ APIs using Span<byte> can be used to process data on both 
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Destructible Resource, SharedMultiSpan 


+ Allocated on the stack, not GC Heap 

e C++ destructor semantics 
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+ No IDisposable, no finalization 

+ SharedMultiSpan is otherwise similar to SharedData 


SharedData Implementation 
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Destructible Resource, SharedMultiSpan 


+ Allocated on the stack, not GC Heap 

e C++ destructor semantics 

+ Enforced by language and compiler 

+ No IDisposable, no finalization 

+ SharedMultiSpan is otherwise similar to SharedData 


Stream 
+ A potentially large sequence of bytes containing 
unstructured data 


+ Producer and consumer may exist in different processes 
+ May be long lived 
+ Important mechanism to marshal data 


StreamWriter Stream 


| Producer 


SharedData Soft/Weak Reference 


e Used primarily in caches 


* Have no methods to access data. To access data, must 
convert to SharedData first 


+ Lifetime rules: 

+ If there are SharedData referencing a data region, the data region is 
guaranteed to remain alive 

+ If there are neither SharedData nor SharedDataSoftReferences referencing a 
data region, regardless whether there are SharedDataWeakReferences 
referencing the data region, it's guaranteed to be freed 

+ If there are no SharedData referencing a data region, but there are 
SharedDataSoftReferences referencing it, regardless whether there are 
SharedDataWeakReferences, the data region is in average case kept alive. 
Only if kernel run out of memory and choose to steal memory from that data 
region, it is freed. 


SharedDataCache 


+ Most data are in SharedDataSoftReference 
+ Small subset of data are in SharedData 
+ Used by backend, large caches (e.g. storage stack) 


+ Allows the system to use most physical memory for caching 
purpose without demand paging 


Shared 


Data Soft/Weak Reference 


e Used primarily in caches 


* Have no methods to access data. To access data, must 
convert to SharedData first 


+ Lifetime 


rules: 


+ If there are SharedData referencing a data region, the data region is 
guaranteed to remain alive 


If there are neither SharedData nor SharedDataSoftReferences referencing a 


data region, regardless whether there are SharedDataWeakReferences 
referencing the data region, it's guaranteed to be freed 

+ If there are no SharedData referencing a data region, but there are 
SharedDataSoftReferences referencing it, regardless whether there are 
SharedDataWeakReferences, the data region is in average case kept alive. 
Only if kernel run out of memory and choose to steal memory from that data 
region, it is freed. 


SharedDataCache 


+ Most data are in SharedDataSoftReference 
+ Small subset of data are in SharedData 
+ Used by backend, large caches (e.g. storage stack) 


+ Allows the system to use most physical memory for caching 
purpose without demand paging 


SharedDataWeakReferenceCache 


+ Only hold SharedDataWeakReference 

e Work as front end cache 
* Reduce IPC overhead to communicate with backend cache 
* Reduce load on backend cache 


Example: 
» Web server static content cache 


Example: Web App IO Pipeline 


e With data travelling through many processes, we 
need an efficient data transfer mechanism. 


- Data must not live on the GC heap 


SharedDataWeakReferenceCache 


+ Only hold SharedDataWeakReference 

- Work as front end cache 
* Reduce IPC overhead to communicate with backend cache 
* Reduce load on backend cache 


Example: 
- Web server static content cache 


Result: Soeech Server Performance 


At more than triple the load, Midori delivers customer-visible latency gains 
+ Windows Phone beats latency target at 3x load; Xbox at 4x 


Midori 1 9 18 27 28 1 9 18 2 35 
un 2s So mm ws ma EE ns io 
' 2% HS si BR 86 59 an un en 97 
s 19353 21368 2328 60685 TS 582 3224 5149 9429 
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* Windows Speech Server cannot complete the benchmark at this concurrency leve 22 


Result: SPECWeb05 


34,800 69,000 
70,600 82,500 


27,300 47,000 
49,461 78,442 


System 


x Quad Core CPU, 48GB 


x 10Gb NIC + 2x 1Gb NIC 
and Midori runs per benchmark rules 


RAM, 4x 512GB SSC 


Logging and HTTPS enabled for both Wind 


HyperThreading enabled 


Windows Settings 
Same a cial Windows submissions on SPEC.org, except for custom tunings that improves Windows 
performance such as affinity, 


IPSEC and Firewall off 


http://midori/Wiki Pages/Comparison of Midori and Windows SpecWeb05 Performance.aspx 
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Result: 10Gb Networking Performance 


Bidirectional workload (both transmit and receive 
simultaneously) 


- Windows: more than two CPU cores to saturate 10Gb wire 


+ Midori: less than a single CPU core to saturate 10Gb wire 


System 
Dell T5500, 2.4GHz CPU, Intel 82599 10Gb NIC 
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